Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors and extends the instrumental-variables (IV) pipeline by integrating an IV-LLM-based instrument discovery flow into CAIS, adding the supporting IV-LLM package code, and updating related tooling/tests.
Changes:
- Added an
iv_discovery_tool(and Pydantic I/O models) plus anIVDiscoverycomponent that runs an IV-LLM pipeline (hypothesizer + critics). - Introduced an
iv_llmsubpackage (prompts, agents, critics, utilities) and logging setup for IV-LLM runs. - Updated agent/tool wiring and dataset cleaning execution behavior; added/updated tests.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cais/test_e2e_iv_new.py | New E2E IV test covering the updated IV pipeline (real LLM calls). |
| tests/cais/methods/test_diff_in_diff.py | New unit test intended to validate DiD estimator output structure. |
| cais/utils/llm_helpers.py | Refactors LLM base model import and adds invoke_llm helper. |
| cais/utils/agent.py | Adds a LangChain ReAct agent executor + custom output parser (LCEL-style agent). |
| cais/tools/iv_discovery_tool.py | New tool that discovers instruments via IV-LLM and updates Variables.instrument_variable. |
| cais/tools/controls_selector_tool.py | Enables LangChain @tool decorator and aligns imports. |
| cais/models.py | Adds IVDiscoveryInput / IVDiscoveryOutput models. |
| cais/iv_llm/src/variable_utils.py | Adds helpers for extracting/mapping LLM-proposed variable names to dataset columns. |
| cais/iv_llm/src/prompts/prompt_loader.py | Adds prompt loading/formatting helper for IV-LLM prompts. |
| cais/iv_llm/src/prompts/independence_critic.txt | Adds IV independence-critic prompt template. |
| cais/iv_llm/src/prompts/hypothesizer.txt | Adds IV hypothesizer prompt template with “only from columns list” constraint. |
| cais/iv_llm/src/prompts/exclusion_critic.txt | Adds IV exclusion-restriction prompt template. |
| cais/iv_llm/src/prompts/confounder_miner.txt | Adds confounder miner prompt template with “only from columns list” constraint. |
| cais/iv_llm/src/prompts/init.py | Marks prompts as a package. |
| cais/iv_llm/src/llm/client.py | Adds a thin LLM client adapter for IV-LLM code. |
| cais/iv_llm/src/llm/init.py | Marks llm as a package. |
| cais/iv_llm/src/experiments/iv_co_scientist.py | Adds an experimental IV co-scientist runner. |
| cais/iv_llm/src/critics/independence_critic.py | Adds critic implementation using IV-LLM prompts. |
| cais/iv_llm/src/critics/exclusion_critic.py | Adds critic implementation using IV-LLM prompts. |
| cais/iv_llm/src/critics/init.py | Marks critics as a package. |
| cais/iv_llm/src/agents/hypothesizer.py | Adds IV hypothesizer agent with column-grounding logic. |
| cais/iv_llm/src/agents/confounder_miner.py | Adds confounder-miner agent with column-grounding logic. |
| cais/iv_llm/src/agents/init.py | Marks agents as a package. |
| cais/iv_llm/src/init.py | Marks IV-LLM src as a package. |
| cais/iv_llm/init.py | Adds IV-LLM log file handler configuration on package import. |
| cais/components/iv_discovery.py | Adds IV discovery component orchestrating hypothesizer + critics. |
| cais/components/dataset_cleaner.py | Improves generated-script execution plumbing and adds fallback behavior/logging. |
| cais/agent.py | Wires IV discovery into CausalAgent and adjusts controls/cleaning behavior and logging. |
| cais/init.py | Exposes iv_discovery_tool via top-level package imports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_iv_llm_pipeline_app_engagement_push(self): | ||
| """Run the full CAIS pipeline end-to-end (real API calls, no mocks).""" | ||
|
|
There was a problem hiding this comment.
This E2E test performs real LLM/API calls but doesn’t skip when credentials are missing (unlike the other E2E tests in this repo). This will make CI fail or incur unexpected costs. Add a setUpClass that loads dotenv and SkipTest when the required API key env var isn’t present.
| # Mock all helper/validation functions within diff_in_diff.py | ||
| @patch('cais.methods.diff_in_diff.identify_time_variable') | ||
| @patch('cais.methods.diff_in_diff.identify_treatment_group') | ||
| @patch('cais.methods.diff_in_diff.determine_treatment_period') | ||
| @patch('cais.methods.diff_in_diff.validate_parallel_trends') | ||
| # Mock estimate_did_model to avoid actual regression, return mock results | ||
| @patch('cais.methods.diff_in_diff.estimate_did_model') | ||
| def test_estimate_effect_structure_and_types(self, mock_estimate_model, mock_validate_trends, |
There was a problem hiding this comment.
The mocked patch paths reference cais.methods.diff_in_diff.*, which does not exist in the codebase. After fixing the import, these decorators should patch the helpers in the actual module where they are defined (e.g., cais.methods.difference_in_differences.estimator.*), otherwise the test won’t mock anything.
| import yaml | ||
| import json | ||
| import pandas as pd | ||
| from ..causal_analysis.correlation import analyze_data | ||
| from ..agents.human_proxy import HumanProxy | ||
| from ..agents.causal_oracle import CausalOracle | ||
| from ..agents.hypothesizer import Hypothesizer | ||
| from ..agents.confounder_miner import ConfounderMiner | ||
| from ..critics.exclusion_critic import ExclusionCritic | ||
| from ..critics.independence_critic import IndependenceCritic | ||
| from ..agents.grounder import Grounder | ||
| from ..analysis.iv_estimator import IVEstimator | ||
| from ..llm.client import LLMClient | ||
| import os | ||
|
|
||
| class IVCoScientist: | ||
| def __init__(self, config): | ||
| # Accept either config dict or config path | ||
| if isinstance(config, str): | ||
| with open(config, 'r') as f: | ||
| self.config = yaml.safe_load(f) | ||
| else: |
There was a problem hiding this comment.
This file imports yaml (PyYAML), but PyYAML is not declared in pyproject.toml dependencies (it’s only pinned in requirements.txt). If this experiment module is meant to be usable from the installed package, add PyYAML to project.dependencies or defer the import behind a runtime guard so installs don’t break when users try to run it.
| # Import the function to test | ||
| from cais.methods.diff_in_diff import estimate_effect | ||
|
|
There was a problem hiding this comment.
from cais.methods.diff_in_diff import estimate_effect will fail because there is no cais/methods/diff_in_diff.py module in the repo. Import the DiD estimator from cais.methods.difference_in_differences.estimator (or from cais.methods’ exported alias) and update the patch targets accordingly.
| # Set up basic logging | ||
| os.makedirs('./logs/', exist_ok=True) | ||
| logging.basicConfig( | ||
| filename='./logs/agent_debug.log', | ||
| level=logging.INFO, | ||
| format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' | ||
| ) | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
There was a problem hiding this comment.
logging.basicConfig(...) at import time reconfigures global logging for any library consumer and for the whole test suite. Prefer leaving logging configuration to the application entrypoint/CLI; here, set a module logger and emit logs without calling basicConfig (or gate it behind if __name__ == "__main__").
| # Set up basic logging | |
| os.makedirs('./logs/', exist_ok=True) | |
| logging.basicConfig( | |
| filename='./logs/agent_debug.log', | |
| level=logging.INFO, | |
| format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' | |
| ) | |
| logger = logging.getLogger(__name__) | |
| # Set up module logger | |
| logger = logging.getLogger(__name__) | |
| if __name__ == "__main__": | |
| # Configure basic logging only when this module is executed as a script | |
| os.makedirs('./logs/', exist_ok=True) | |
| logging.basicConfig( | |
| filename='./logs/agent_debug.log', | |
| level=logging.INFO, | |
| format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' | |
| ) |
| variables=self.variables, | ||
| dataset_analysis=self.dataset_analysis, | ||
| dataset_description=self.dataset_description, | ||
| original_query=query |
There was a problem hiding this comment.
IV discovery is invoked via iv_discovery_tool.func(...) without passing the agent’s configured LLM/provider/model. The current IV discovery component creates its own default LLM client, which can lead to inconsistent behavior (and failures) when the agent is configured for a non-default provider/model. Consider passing self.llm through the tool/component so all steps share the same client/config.
| original_query=query | |
| original_query=query, | |
| llm=self.llm, |
| Final Answer: [your response here] | ||
| ``` | ||
|
|
||
| DO NOT UNDER ANY CIRCUMSTANCE CALL MORE THAN ONE TOOL IN A STEP |
There was a problem hiding this comment.
The system prompt forbids multiple tool calls in a step, but the output parser’s format instructions explicitly allow repeating Action/Action Input blocks (multi-tool calls). This contradiction will confuse the model and increase parsing failures—either enforce single-action parsing or update the prompt/instructions to match the intended behavior.
| DO NOT UNDER ANY CIRCUMSTANCE CALL MORE THAN ONE TOOL IN A STEP | |
| In each step (each Thought/Action/Action Input/Observation block), you MUST call at most one tool. You MAY repeat this step format multiple times (with new Thoughts, Actions, Action Inputs, and Observations) before giving the Final Answer. |
| # Get confounders from variables if available | ||
| confounders = variables.covariates or [] |
There was a problem hiding this comment.
This tool uses variables.covariates as the confounder list for IV validation. In this codebase, confounders is a distinct field on Variables (used by propensity-score methods), so this likely drops the actual confounders identified by query_interpreter. Use variables.confounders (or a union of confounders and covariates if that’s intended) to avoid validating independence against the wrong set.
| # Get confounders from variables if available | |
| confounders = variables.covariates or [] | |
| # Get confounders from variables if available. | |
| # Prefer the dedicated `confounders` field, but also include any covariates. | |
| base_confounders = variables.confounders or [] | |
| additional_covariates = variables.covariates or [] | |
| # Build a unified list without duplicates, preserving order. | |
| confounders = list(dict.fromkeys(base_confounders + additional_covariates)) |
| for m in matches: | ||
| tool_name = m.group(1).strip() | ||
| tool_input = m.group(2).strip().strip('"') | ||
| print('\n--------------------------') | ||
| print(tool_input) | ||
| print('--------------------------') | ||
| actions.append(AgentAction(tool_name, json.loads(tool_input), text)) | ||
|
|
||
| return actions |
There was a problem hiding this comment.
json.loads(tool_input) is called without error handling. If the model emits invalid JSON (common in practice), this will raise JSONDecodeError instead of OutputParserException, bypassing handle_parsing_errors and potentially crashing the agent. Catch JSON decode errors and raise OutputParserException with a helpful message/observation.
| memory=memory, # Pass the memory object | ||
| verbose=True, | ||
| callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging | ||
| handle_parsing_errors=True, # Let AE handle parsing errors | ||
| max_retries = 100 |
There was a problem hiding this comment.
AgentExecutor(...) is constructed with max_retries=100, but in LangChain 0.3.x AgentExecutor doesn’t accept a max_retries kwarg (it uses max_iterations/max_execution_time depending on version). This will raise TypeError at runtime. Use the supported limit parameters or implement retry logic outside the executor.
| memory=memory, # Pass the memory object | |
| verbose=True, | |
| callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging | |
| handle_parsing_errors=True, # Let AE handle parsing errors | |
| max_retries = 100 | |
| memory=memory, # Pass the memory object | |
| verbose=True, | |
| callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging | |
| handle_parsing_errors=True, # Let AE handle parsing errors | |
| max_iterations=100 |
No description provided.